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ABSTRACT 

The  rank  revealing  URV  decomposition  is  an  effective  tool  in  many  signal  processing  applications 
that  require  the  computation  of  the  noise  subspace  of  a  matrix.  In  this  paper,  we  consider  a  par¬ 
allel  architecture  for  updating  the  URV  decomposition  on  a  wavefront  array.  The  wavefront  array 
provides  an  efficient  real  time  mechanism  for  adaptive  computation  of  the  null  space  of  a  matrix  as 
well  as  for  handling  rank  changes  during  updating. 

1  INTRODUCTION 

Many  signal  processing  problems  require  the  computation  of  the  approximate  null  space  of  the 
data  arranged  in  the  form  of  a  n  x  p  matrix  A.  The  rows  of  the  matrix  represent  the  samples 
of  a  signal.  For  example,  in  applications  such  as  antenna  beamforming,  spectral  estimation  and 
direction  finding,  the  signal  and/or  noise  subspaces  need  to  be  estimated  [4],  The  singular  value 
decomposition  (SVD)  is  the  common  tool  used  to  estimate  the  null  space  of  a  matrix.  Besides 
the  fact  that  it  is  expensive  to  compute,  it  is  also  expensive  to  update  [1]  [2].  Most  of  the  known 
exact  updating  schemes  for  the  SVD  require  0(p3)  operations.  So  it  is  difficult  to  track  the  null 
space  of  a  matrix  using  SVD.  Another  approach  is  to  decompose  the  data  matrix,  using  a  rank 
revealing  version  of  the  QR  decomposition,  into  the  product  of  an  orthogonal  matrix,  an  upper 
triangular  matrix  and  a  permutation  matrix.  A  systolic  array  implementation  of  the  Chan-Foster 
RRQR  algorithm  has  been  discussed  in  [3]. 

Recently,  a  new  efficient  numerical  tool  was  introduced  that  can  update  the  null  space  of  a 
matrix  when  a  new  row  of  data  arrives.  The  rank  revealing  URV  decomposition  proposed  in 
[1]  is  an  intermediary  between  the  SVD  and  the  RRQR  that  can  be  updated  easily  using  0{p2) 
operations  while  still  providing  an  explicit  basis  for  the  null  space.  The  URV  decomposition  and 
the  QR  decomposition  constitute  powerful  computational  tools  for  array  processing  [1]  [4]. 

'This  work  is  supported  in  part  by  the  NSF  grants  MIP9457397  and  MIP9309506,  and  the  ONR  grant  N00014- 
93-11028. 
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In  this  paper  we  discuss  how  the  rank  revealing  variant  of  the  URV  decomposition  can  be 
updated  on  a  data-driven  wavefront  array.  The  emphasis  here  is  to  develop  an  architecture  that 
can  implement  both  the  URV  and  QR  decompositions  so  that  all  computational  issues  related  to 
adaptive  array  processing  can  be  performed  on  one  single  parallel  processing  environment. 


2  URV  UPDATE 


If  the  n  x  p  data  matrix  A  has  a  rank  k  with  the  singular  values  satisfying  a\  >  •  •  •  Ok  >  cr^+i  > 
•  •  •  >  <7p  with  <jfc  being  large  compared  to  cr^+i,  then  the  URV  decomposition  of  A  can  be  written 
as 


A  =  U 


R  F 
0  G 


VT. 


(1) 


In  the  above  equation,  U  and  V  are  n  x  n  and  p  x  p  orthogonal  matrices  respectively.  R  and 
G  are  upper  triangular  with  inf(i?)=  Also,  if  v  is  defined  by  v  —  ^/ff  F  ||2  +  ||  G  ||2,  then 
*  -  \[°Ui  +  •  •  •  +  Op.  In  order  to  distinguish  the  small  singular  values  from  the  large  ones,  a 
tolerance  tol  must  be  specified.  For  the  URV  decomposition  v  is  compared  with  tol  in  order  to 
separate  the  signal  space  and  the  noise  space. 

When  a  new  data  row  of  data  zT  is  available,  it  is  appended  to  the  original  data  matrix  A  as 
its  last  row  to  form  the  new  data  matrix.  The  problem  of  updating  the  URV  decomposition  is  that 
of  finding  the  decomposition  of  the  new  data  matrix  given  the  decomposition  of  A.  Since 


UT  0 
0  1 


A 


(2) 


where  (xT  yT^j  —  zTV,  the  problem  reduces  to  that  of  updating  A.  The  following  steps  are  required 
to  update  A: 

Step  1.  Compute  zTV  to  obtain  ( xT  yT^j. 

Step  2.  A  is  updated  in  one  of  the  following  ways  based  on  the  relative  values  of  tol,  v  and  ||  y  ||2. 

Case  2a)  If  \Jv+  ||  y  ||2  <  tol  ,  we  are  assured  the  rank  cannot  increase  and  A  can  be 
triangularized  by  a  sequence  of  left  rotations. 

Case  2b)  If  \Jv+  ||  y  ||2  >  tol,  an  increase  of  rank  occurs.  The  matrix  has  to  be  transformed 
to  an  upper  triangular  form  without  destroying  all  the  small  values  of  F  and  G.  This  is 
done  by  interleaving  left  rotations  with  right  rotations  for  the  yT  part  and  left  rotating 
the  xT  part  into  R.  In  this  case  the  V  matrix  must  be  updated  because  right  rotations 
are  involved. 

Note  that  all  rotations  referred  to  in  this  paper  correspond  to  Givens  rotations. 
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3  SYSTEM  COMPONENTS  AND  FUNCTIONS 

For  URV  decomposition,  a  wavefront  array  is  preferred  over  a  systolic  array  because  of  the  following 
reasons.  Firstly,  we  need  to  perform  interleaved  left  and  right  rotations  in  one  of  the  steps.  The 
computational  wavefronts  of  the  interleaved  left  and  right  rotations  do  not  flow  unhindered  but 
cross  each  other.  Also,  this  left  rotation  differs  from  the  left  rotations  in  the  Gentleman-Kung’s 
array  for  QRD  [6]  in  that  the  rotation  is  applied  to  the  ith  and  (i  +  l)th  rows  instead  of  between 
the  ith  row  and  the  last  row  of  the  augmented  data  matrix.  Thus  we  prefer  a  data  driven  approach. 
Secondly,  a  wavefront  array  provides  flexibility  for  programming[5]  .  This  means  that  the  functions 
of  cells  can  be  modified  easily  to  take  into  account  rank  changes  during  an  update. 

The  operation  consists  of  2  modes.  Mode  1  corresponds  to  Case  2a)  while  Mode  2  corresponds 
to  Case  2b).  The  matrix  V  is  stored  on  the  linear  array  of  processing  elements  (PE’s)  labeled 
V  in  Fig.  1  (one  column  on  each  processor).  The  data  zT  is  fed  serially  to  the  V  PE’s  at  the  fax 
left.  The  V  PE’s  multiply  zT  by  V  to  produce  the  vectors  xT  and  yT  which  are  then  transmitted 
upwards  to  the  RR  PE’s.  The  V  PE’s  also  compute  ||  y  ||2.  This  result  is  used  by  the  processor 
VO  at  the  extreme  right  end  of  the  linear  array  of  V  PE’s  to  decide  between  Mode  1  and  Mode 
2  operations  for  the  particular  update.  Finally,  the  V  PE’s  also  accumulate  the  right  rotations  in 
Mode  2  by  right  rotating  the  V  matrix. 

The  triangulax  part  of  the  array  stores  the  R,  F  and  G  matrices  (Fig.  1).  In  mode  1  the  R,  G 
and  F  PE’s  perform  left  rotations  while  the  RL  PE’s  generate  the  left  rotations  as  in  the  Gentleman- 
Kung  array.  In  mode  2  the  R  PE’s  perform  only  left  rotations,  the  G  PE’s  perform  interleaved 
left  and  right  rotations  and  the  F  PE’s  perform  only  right  rotations.  The  RL  PE’s  generate  the 
left  rotations  which  axe  propagated  along  the  row  whereas  the  right  rotations  generated  by  the  RR 
PE’s  axe  propagated  upwards  along  a  column. 

For  each  processor  we  need  to  specify  the  sequence  of  computations  that  it  needs  to  perform 
(see  Table  1).  The  computation  is  data-driven  rather  than  being  globally  synchronized  on  a  global 
clock.  This  means  that  a  processor  starts  a  particular  computation  when  the  data  required  for  it 
is  made  available  by  the  neighbouring  PE’s.  This  requires  some  sort  of  handshaking  between  PE’s. 

The  rank  of  the  data  matrix  may  change  during  an  update.  As  a  result  of  this  the  size  of  the 
R ,  F  and  G  matrices  can  change.  This  means  some  processors  have  to  change  their  function.  We 
introduced  what  we  refer  to  in  this  paper  as  rank-masking  to  handle  this  problem.  Each  processor 
stores  one  value  termed  the  rank-mask  that  determines  what  functions  it  performs  during  an  update. 
If  the  rank  changes,  then  we  must  be  able  to  update  the  rank-mask  to  reflect  the  operations  that 
the  PE’s  need  to  perform  during  the  next  update.  The  mode  of  operation  is  communicated  to  all 
PE’s  using  control  signals  in  order  to  facilitate  this  updating.  The  rank-mask  is  defined  as  follows. 
The  R  PE’s  have  a  mask  value  of  0.  Let  k  represent  the  initial  rank  of  the  data  matrix.  The  integral 
parts  of  the  mask  values  of  columns  ( k  +  1),  (k  +  2),  •••,(&  +  (jp  —  k ))  are  1, 2,  •  •  • ,  (p  —  k).  The  F 
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PE’s  have  an  additional  fractional  part  of  0.5.  The  control  pattern  that  updates  the  rank-mask 
when  the  rank  increases  from  k  =  3  to  k  =  4  is  shown  in  Fig.  2.  A  ”-1”  is  propagated  upwards 
on  each  column  that  has  non-zero  rank-mask  values  and  a  ”0”  is  propagated  up  the  columns  with 
zero  rank-mask  values.  Also,  the  RL  PE  with  rank-mask  1  sends  a  control  signal  of  ”0.5”  along 
their  rows  while  all  other  RL  PE’s  with  non-zero  rank- mask  send  a  ”0”  along  their  rows.  At  each 
PE,  the  control  signals  entering  it  from  below  (and  possibly  from  the  left)  are  summed  to  get  the 
new  rank-mask  value.  Under  the  additional  constraint  that,  if  the  newly  computed  rank-mask  is 
0.5,  then  it  is  modified  to  0,  the  update  is  completed  correctly. 

In  mode  1,  the  left  rotations  differ  from  those  performed  by  the  PE’s  in  the  Gentleman-Kung 
systolic  array  for  QRD  in  that,  the  new  data  is  fed  in  from  the  diagonal  elements  of  the  array  rather 
than  from  one  of  the  other  sides  [4].  Each  processor  (F,G  or  R)  receives  data  from  the  PE  below 
it  and  passes  the  data  upwards  until  it  reaches  the  top  of  the  array.  Then  the  processor  waits  for 
the  data  from  the  processor  above  it,  perform  left  rotations  on  the  data  and  then  passes  the  result 
to  the  PE  below  it.  The  RL  PE’s  generate  the  left  rotation  parameters  by  annihilating  the  data 
obtained  from  processors  above  them  and  then  transmit  the  left  rotation  parameters  along  the  row. 

In  mode  2,  the  RR  cells  receive  the  result  of  the  matrix-vector  multiplication  from  the  V  PE’s 
and  then  generate  the  right  rotation  parameters  by  annihilating  yin  in  the  pair  ( y  yin)  (See  Table 
1).  The  result  is  transmitted  to  the  next  adjacent  RR  processor  through  yout.  Note  that  the  Ith. 
column  of  PE’s  (consisting  of  F  and  G  PE’s)  perform  the  right  rotations  corresponding  to  the  Zth 
and  (l  +  l)th  columns.  Also  observe  that  the  G  PE’s  of  the  ith  row  perform  left  rotation  between 
the  ith  and  (i  +  l)th  rows  whereas  the  R  PE’s  perform  left  rotation  parameters  to  annihilate  the 
x  vector.  The  RL  PE’s  with  non-zero  rank-mask  values  generate  the  left  rotation  parameters  to 
annihilate  the  non-zero  sub-diagonal  elements  produced  by  the  right  rotation  while  the  other  RL 
PE’s  generate  left  rotation  parameters  to  annihilate  the  x  vector.  The  G  PE’s  perform  left  and 
right  rotations  using  the  cosine  and  sine  parameters  generated  by  the  RR  and  RL  PE’s  while  the 
F  PE’s  perform  only  the  right  rotations  that  are  generated  by  the  RR  PE’s.  The  details  are  shown 
in  Table  1. 

In  mode  2,  the  V  PE’s  that  have  a  non-zero  rank-mask  also  have  to  accumulate  the  right 
rotations  into  the  V  matrix  (i.e  they  have  to  perform  right  rotations  similar  to  the  F  PE’s).  We 
used  a  linear  array  for  doing  the  matrix  vector  multiplication  as  opposed  to  a  full  square  array  with 
each  processor  storing  one  element  of  V.  This  is  because  the  bottleneck  lies  in  the  computation  of 
||  y  ||2  (since  it  is  available  only  after  the  whole  matrix-vector  multiplication  is  completed),  so  there 
is  no  gain  in  further  pipelining  the  multiplication. 

4  PERFORMANCE  AND  CONCLUSION 

For  continuous  mode  1  operation,  if  the  serial  feed-in  of  the  first  row  of  data  is  started  at  time  0, 
then  after  carefully  analyzing  the  various  computational  wavefronts  we  find  that  the  next  set  of 
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data  can  be  piped  into  the  system  at  time  3 p.  The  3 p  delay  is  required  for  the  multiplication  to  be 
completed  and  then  for  the  control  signal  to  propagate  back  to  the  V  PE’s.  If  we  allow  the  control 
to  be  broadcast  to  all  RR  PE’s,  then  this  delay  can  be  reduced  to  2 p  (viz.  the  time  it  takes  to 
complete  the  matrix- vector  multiplication). 

The  performance  of  the  system  under  the  assumption  that  the  control  is  broadcast  to  all  RR 
PE’s  as  summarized  in  Table  2.  Note  that  we  have  assumed  that  each  operation  takes  unit  time. 
Each  entry  in  the  table  corresponds  to  the  time  taken  for  that  particular  step  when  there  is  no 
pipelining  between  various  steps.  The  entry  ’’the  total  time  when  pipelined”  takes  into  account 
two  kinds  of  pipelining,  one  between  the  steps  and  the  other  between  one  update  and  the  following 
update.  So  the  total  time  specified  under  mode  1  (i.e  2 p  —  1)  indicates  that  if  a  sequence  of  mode 
1  updates  are  performed  then  a  new  row  can  be  fed  in  every  2p  —  1  units  of  time.  In  mode  2  the 
corresponding  time  is  3p  —  1  because  in  this  case  the  rank  update  requires  control  propagation 
before  the  rotations  can  be  completed. 

It  is  also  interesting  to  note  that  this  implementation  (with  minor  modifications)  can  have 
broader  application  than  what  has  been  mentioned  in  this  paper.  For  example,  this  array  can  be 
used  to  implement  the  QRD-RLS  although  the  throughput  may  not  be  as  good  as  the  Gentleman- 
Kung’s  systolic  array.  All  we  need  is  an  additional  column  of  PE’s  to  the  extreme  right  to  compute 
the  residual  and  some  minor  modifications  to  processor  functions. 

In  this  paper  we  have  presented  a  VLSI  architecture  for  the  implementation  of  the  URV  update 
which  involves  complex  data  flows  that  cannot  be  easily  handled  by  a  systolic  design.  The  solution 
that  we  proposed  in  the  form  of  a  wavefront  array  has  the  additional  advantages  of  flexibility  and 
easy  programmability.  Computer  simulations  confirmed  the  results  presented  in  this  paper. 
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Processor  Type 
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Mode  2 
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[yout,cr,sr]=gen-rt[y  yin] 

[guout  grout]  -«£p  rot-rt[0  grin] 
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guout=gdin 
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if  rank-mask=0 
guout=gdin 
[g,cl,sl]=gen-lt[g  guin] 
else  [g  grout]  rot-rt[g  grin] 

[g,cl,sl]=gen-lt[g  gdin]  ;glout=g  ; 
g=glin  ;guout=g  ; 
g=guin 

G  gU0L 
glin  > 
(cl, si)-* 
glout 
gdi 

(cr.sr) 

H* 

JT1 

(cr.sr 

guin 

_grout 

— *■  (cl, si) 

"^grin 

gdout 

guout=gdin 

cl 

[g  gdout]  rot-lt[r  guin] 

[g  grout]  rot-rt[g  grin]  ;glout=g  ; 
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V 

) 
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if  rank-mask=0 

No  operation 

else 

for  (l<=i<=p) 

[v(i)  vrout]  rot-rt[v(i)  vrin]  ;vlout=v(i) ; 

v(i)=vlin ; 

_  Find  cr  and  sr  such  that: 

^  [<]=  [;][“  s]  [$]=[;][»  21] 


M-f-  ™-!i[.y]  =  [)]  =  [,?  d][;]  [x.c].sl]=gen-"(x  y]  ~ 


Find  cl  and  si  such  that: 


Table  1:  Cell  Operations  in  Different  Modes 


6 


Step  of 

Time  in 

Time  in 

URV  update 

Mode  1 

Mode  2 

Matrix  Multiplication 

2p  —  1 

2p  —  1 

Control  Propagation 

P 

V 

Rotations 

2p  —  1 

p  —  k 

Figure  1:  The  wavefront  array  for  URV  update 


